Summary

Quality control

I first check the quality of all samples by looking at the number of reads per cell, the number of features (genes) per cell, and the percent mitochondrial genes per cell. In cases where the cells were dying, we generally see a higher percent mitochondria. I like to see that most cells have less than 15% of reads mapping to mitochondrial reads.

Here it seems that most cells look pretty good. You have ~4,000-5,000 genes per cell which is amazing. I normally see closer to 3,000 genes per cell. There are definitely some cells with a really high mitochondiral percent. I filter these cells out before continuing. In general, there were fewer AT02 cells that were filtered out because of quality concerns.

My filters

  • percent.mt < 10 (all cells with more thant 10% mitochondiral reads are filtered out)
  • nFeature_RNA > 100 (any cells with less than 100 genes were excluded). I normally make this cutoff between 200-500, but that is too low for this dataset
  • nFeature_RNA < 1500 (there were a handful of cells with many features. These are generally removed as they are suspected to be doublets)
  • nCount_ADT < 1000 (Any cells with more than 1000 ADT reads were excluded. There were not many.)

healthy_bcells_1

RNA

if(sample_info$type == "single"){
  unprocessed_object <- readRDS(sample_info$seurat_unprocessed_path)
} else {
  unprocessed_object <- sample_info$seurat_object
}

Idents(unprocessed_object) <- "orig.ident"
rna_qual <- VlnPlot(unprocessed_object,
                    features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
                    ncol = 3)
rna_qual

median_counts <- median(unprocessed_object$nCount_RNA)
median_features <- median(unprocessed_object$nFeature_RNA)
cell_number <- nrow(unprocessed_object[[]])
Metrics
  • The median counts per cell is 848
  • The median genes per cell is 414
  • The total number of cells before filtering is 10684

ADT

adt_qual <-   adt_qual <- VlnPlot(unprocessed_object,
                                  features = c("nCount_ADT", "nFeature_ADT"))

adt_qual

median_counts_adt <- median(unprocessed_object$nCount_ADT)
median_features_adt <- median(unprocessed_object$nFeature_ADT)
Metrics
  • The median ADT counts per cell is 83
  • The median ADTs per cell is 4

Doublet Removal

I next identify likely doublets using DoubletFinder which attempts to model what doublets would look like based on a mixing of the different clusters. The doublets identified are shown below. I filter these out before continuing the analysis.

healthy_bcells_1

doublet_object <- readRDS(sample_info$seurat_doublet_path)

plotDimRed(doublet_object,
           col_by = "Doublet_finder",
           plot_type = "rna.umap")[[1]]

final_object <- sample_info$seurat_object
median_counts <- median(final_object$nCount_RNA)
median_features <- median(final_object$nFeature_RNA)
cell_number <- nrow(final_object[[]])

Metrics

  • The median counts per cell after all filtering is 824
  • The median genes per cell after all filtering is 406
  • The total number of cells after all filtering is 9775

PCA

I next do an initial dimensional reduction with using PCA on the top 2000 variable genes. The PCA is not too informative, but it’s worth looking at a few metrics.

  1. The sample - because there is only one sample, this is mostly just the shape
  2. The number of UMIs seen in each cell. There is some correlation between this and PC2 so we will need to make sure this isn’t captured when we make a UMAP.
  3. The number of genes seen in each cell. Again, there is some correlation between this and PC2 so we will need to make sure this isn’t captured when we make a UMAP.
  4. The percent mitochondrial reads per cell. We don’t want to see much correlation between this and either of the PCs.

healthy_bcells_1

seurat_object <- sample_info$seurat_object

plot1 <- plotDimRed(seurat_object, col_by = "orig.ident",
           plot_type = "pca", color = sample_colors)[[1]]
plot2 <- plotDimRed(seurat_object, col_by = c("nCount_RNA",
                                              "nFeature_RNA",
                                              "percent.mt"),
                    plot_type = "pca")
plot_grid(plot1, plot2[[1]], plot2[[2]], plot2[[3]],
          labels = c("A", "B", "C", "D"),
          align = "hv", axis = "lr")

UMAP

I next follow the PCA with a UMAP dimensional reduction. Rather than use genes for UMAP, we use the top PCs (~30 for all of these samples). I’ll plot the same metrics for UMAP as for the PCA.

  1. The sample - because there is only one sample, this is mostly just the shape
  2. The number of UMIs seen in each cell. There is some correlation between this and the UMAP. It could be technical or cell type specific. We can decided this based on the gene expression patterns.
  3. The number of genes seen in each cell. There is a little less correlation here, but we do still see a correlation.
  4. The percent mitochondrial reads per cell. There isn’t much correlation here which is a good sign.
  5. The clusters identified by my analysis - you can see that there isn’t a lot of structure either in the UMAP or the clusters.

healthy_bcells_1

seurat_object <- sample_info$seurat_object
cluster_colors <- sample_info$cluster_colors

plot1 <- plotDimRed(seurat_object, col_by = "orig.ident",
           plot_type = "rna.umap", color = sample_colors)[[1]]
plot2 <- plotDimRed(seurat_object, col_by = c("nCount_RNA",
                                              "nFeature_RNA",
                                              "percent.mt"),
                    plot_type = "rna.umap")
plot3 <- plotDimRed(seurat_object, col_by = "RNA_cluster",
                    color = cluster_colors, plot_type = "rna.umap")[[1]]
plot_grid(plot1, plot2[[1]], plot2[[2]], plot2[[3]], plot3,
          labels = c("A", "B", "C", "D", "E"),
          nrow = 3, ncol = 2,
          align = "hv", axis = "lr")

Celltypes

I named celltypes based using clustifyr which uses a reference dataset to name clusters. First, average expression is found for each cluster in the reference dataset, then a correlation is run between the reference cluster and our clusters. I used The human single cell atlas as the reference.

  1. UMAP colored by clusters
  2. UMAP colored by cell type defined within each individual sample
  3. Barplot showing the percent of cells that were assigned to each cell type.

healthy_bcells_1

seurat_object <- sample_info$seurat_object

if(sample_info$type == "single"){
  plot1 <- plotDimRed(seurat_object, col_by = "RNA_cluster",
           plot_type = "rna.umap")[[1]]
  plot2 <- plotDimRed(seurat_object,
                      col_by = "RNA_celltype", plot_type = "rna.umap",
                      color = celltype_colors)[[1]]
  barplot1 <- stacked_barplots(seurat_object = seurat_object,
                               meta_col = "RNA_celltype",
                               percent = TRUE,
                               color = celltype_colors)
  plot_grid(plot1, plot2, NULL, NULL, barplot1,
            labels = c("A", "B", "", "", "C"),
            align = "hv", axis = "lr",
            nrow = 2, ncol = 2)
} else {
  plot1 <- plotDimRed(seurat_object, col_by = "uncorrected_cluster",
           plot_type = "rna.umap")[[1]]
  plot2 <- plotDimRed(seurat_object,
                      col_by = c("RNA_celltype",
                                 "RNA_combined_celltype"),
                      plot_type = "rna.umap",
                      color = celltype_colors)
  barplot1 <- stacked_barplots(seurat_object = seurat_object,
                               meta_col = "RNA_combined_celltype",
                               percent = TRUE,
                               color = celltype_colors,
                               split_by = "sample")
  plot_grid(plot1, plot2[[1]], plot2[[2]], NULL, barplot1,
            labels = c("A", "B", "C", "", "D"),
            align = "hv", axis = "lr",
            nrow = 3, ncol = 2)
}

Heatmaps of marker genes

I next found differentially expressed genes between all of the clusters and cell types to see if this agrees with the cell type naming above. Overall, I’d agree with the cell types. I’m only showing the top 10 markers of each cell type/cluster below.

Because the sequencing depth is so low, I’d take these DE genes with a grain of salt.

Cell types

  1. Left, expression of DE genes in cell types. The x-axis is colored by celltype. Because most of the cells are B cells, this plot is not very appealing. Additionally, the low sequencing depth (and high dropout rate) makes this plot look very sparse.
  2. Right, expression of DE genes in cell types averaged within each cell type. The x axis is colored by cell type. Because this is average expression, it looks much better.
sample_info <- sample_list[["healthy_bcells_1"]]
seurat_data <- sample_info$seurat_object
save_dir <- sample_info$save_dir

# Get marker genes that are saved
marker_genes_rna <- read.csv(file.path(save_dir,
                                    "files/DE/rna_markers_RNA_celltype.csv"),
                             row.names = 1)

# Pick out the top 10 makers of each cluster
top10_rna <- marker_genes_rna %>% 
  dplyr::group_by(cluster) %>% 
  dplyr::top_n(n = 10, wt = avg_log2FC) %>%
  dplyr::arrange(cluster)

# Make a heatmap with these genes
rna_heatmap <- plot_heatmap(seurat_data, top10_rna$gene, "RNA_celltype",
                            average_expression = FALSE)[[4]]

# Make a heatmap using the average values from each cluster
rna_heatmap_ave <- plot_heatmap(seurat_data, top10_rna$gene, "RNA_celltype",
                            average_expression = TRUE)[[4]]


plot_grid(rna_heatmap, rna_heatmap_ave,
          nrow = 1, ncol = 2,
          align = "hv",
          axis = "tb")

Clusters

  1. Left, expression of DE genes in clusters. I’ve colored the x axis by the cluster. Again, this data is pretty sparse so the single cell heatmap doesn’t look great.
  2. Right, expression of DE genes in clusters averaged within each cluster. I’ve colored bye x axis by the cluster
celltype_cluster_colors <- sample_info$celltype_cluster_colors

# Get marker genes that are saved
marker_genes_rna <- read.csv(file.path(save_dir,
                                    "files/DE/RNA_markers_celltype_cluster.csv"),
                             row.names = 1)

# Pick out the top 10 makers of each cluster
top10_rna <- marker_genes_rna %>% 
  dplyr::group_by(cluster) %>% 
  dplyr::top_n(n = 10, wt = avg_log2FC) %>%
  dplyr::arrange(cluster)

# Make a heatmap with these genes
rna_heatmap <- plot_heatmap(seurat_data, top10_rna$gene, "celltype_cluster",
                            average_expression = FALSE,
                            colors = celltype_cluster_colors)[[4]]

# Make a heatmap using the average values from each cluster
rna_heatmap_ave <- plot_heatmap(seurat_data, top10_rna$gene, "celltype_cluster",
                                average_expression = TRUE,
                                colors = celltype_cluster_colors)[[4]]



plot_grid(rna_heatmap, rna_heatmap_ave,
          nrow = 1, ncol = 2,
          align = "hv",
          axis = "tb")

Plots of ADTs

healthy_bcells_1

merged_seurat <- sample_info$seurat_object
cluster_colors <- sample_info$cluster_colors

sep_1 <- violin_col_by_1 <- "RNA_cluster"
colors_1 <- cluster_colors
umap_cols_1 <- colors_1

if(sample_info$type == "combined"){
  sep_2 <- violin_col_by_2 <- "RNA_combined_celltype"
} else {
  sep_2 <- violin_col_by_2 <- "RNA_celltype"
}
colors_2 <- umap_cols_2 <- celltype_colors

plot_type = "rna.umap"

extra_pound_gene <- paste0(extra_pound_gene_all, "#")

gene_chunks <- genes %>%
  map(~knit_expand(file = gene_template, sample = sample, group = group,
                   gene = .x, subset = FALSE))

IgM

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

IgD

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

CXCR5.1

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

CD21

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

CD27.1

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

Plots of RNA

healthy_bcells_1

merged_seurat <- sample_info$seurat_object
cluster_colors <- sample_info$cluster_colors

sep_1 <- violin_col_by_1 <- "RNA_cluster"
colors_1 <- cluster_colors
umap_cols_1 <- colors_1

if(sample_info$type == "combined"){
  sep_2 <- violin_col_by_2 <- "RNA_combined_celltype"
} else {
  sep_2 <- violin_col_by_2 <- "RNA_celltype"
}
colors_2 <- umap_cols_2 <- celltype_colors

plot_type = "rna.umap"

extra_pound_gene <- paste0(extra_pound_gene_all, "#")

gene_chunks <- genes %>%
  map(~knit_expand(file = gene_template, sample = sample, group = group,
                   gene = .x, subset = FALSE))

CR2

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

CD27

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

CXCR5

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

IGHD

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

IGHM

Plots of gene expression

  1. Plot of gene expression on single cell UMAP

  2. Plot of RNA_cluster in all cells

  3. Violin plot of RNA_cluster in all cells

  4. Plot of RNA_celltype in all cells

  5. Violin plot of RNA_celltype in all cells

Density plots ADT

I next looked at a density plot of each protein to see if there was any patterns that could be used to find cutoffs for identifying your subset. While there is certainly structure in the IgM and IgG that I can see, I don’t see as clear of cutoffs for CD21 and CXCR5.

healthy_bcells_1

seurat_object <- sample_info$seurat_object

if (group == "rna"){
  plotting_df <- GetAssayData(seurat_object, slot = "data", assay = "RNA") %>%
    as.matrix %>%
    t %>%
    data.frame %>%
    dplyr::select(all_of(plotting_list))
} else if (group == "adt"){
  plotting_df <- GetAssayData(seurat_object, slot = "data", assay = "CLR_ADT") %>%
    as.matrix %>%
    t %>%
    data.frame %>%
    dplyr::select(all_of(plotting_list))
} else {
  stop("Group must be adt or rna")
}

plotting_df <- cbind(plotting_df, seurat_object[["RNA_cluster"]])

plotting_df_long <- plotting_df %>%
  tidyr::pivot_longer(cols = all_of(plotting_list),
                      names_to = "protein",
                      values_to = "expression")
  
plot1 <- ggplot2::ggplot(data = plotting_df_long,
                         mapping = ggplot2::aes(expression, color = protein)) + 
  ggplot2::geom_density() + 
  ggplot2::scale_color_manual(values = protein_colors)


plot2 <- ggplot2::ggplot(data = plotting_df_long,
                         mapping = ggplot2::aes(x = expression,
                                                color = protein,
                                                y = protein,
                                                fill = protein)) +
  ggridges::geom_density_ridges() + 
#  ggridges::theme_ridges()
  ggplot2::scale_color_manual(values = protein_colors) +
  ggplot2::scale_fill_manual(values = protein_colors)

plot_grid(plot1, plot2,
          labels = c("A", "B"),
          nrow = 1, ncol = 2,
          align = "hv", axis = "lr")

Density plots ADT no CD27

I have repeated the density plots without CD27 so it is easier to visualize patterns.

healthy_bcells_1

seurat_object <- sample_info$seurat_object

if (group == "rna"){
  plotting_df <- GetAssayData(seurat_object, slot = "data", assay = "RNA") %>%
    as.matrix %>%
    t %>%
    data.frame %>%
    dplyr::select(all_of(plotting_list))
} else if (group == "adt"){
  plotting_df <- GetAssayData(seurat_object, slot = "data", assay = "CLR_ADT") %>%
    as.matrix %>%
    t %>%
    data.frame %>%
    dplyr::select(all_of(plotting_list))
} else {
  stop("Group must be adt or rna")
}

plotting_df <- cbind(plotting_df, seurat_object[["RNA_cluster"]])

plotting_df_long <- plotting_df %>%
  tidyr::pivot_longer(cols = all_of(plotting_list),
                      names_to = "protein",
                      values_to = "expression")
  
plot1 <- ggplot2::ggplot(data = plotting_df_long,
                         mapping = ggplot2::aes(expression, color = protein)) + 
  ggplot2::geom_density() + 
  ggplot2::scale_color_manual(values = protein_colors)


plot2 <- ggplot2::ggplot(data = plotting_df_long,
                         mapping = ggplot2::aes(x = expression,
                                                color = protein,
                                                y = protein,
                                                fill = protein)) +
  ggridges::geom_density_ridges() + 
#  ggridges::theme_ridges()
  ggplot2::scale_color_manual(values = protein_colors) +
  ggplot2::scale_fill_manual(values = protein_colors)

plot_grid(plot1, plot2,
          labels = c("A", "B"),
          nrow = 1, ncol = 2,
          align = "hv", axis = "lr")

Density plots RNA

I also tried with the RNA, but there wasn’t much expression. This will probably get better with deeper sequencing.

healthy_bcells_1

seurat_object <- sample_info$seurat_object

if (group == "rna"){
  plotting_df <- GetAssayData(seurat_object, slot = "data", assay = "RNA") %>%
    as.matrix %>%
    t %>%
    data.frame %>%
    dplyr::select(all_of(plotting_list))
} else if (group == "adt"){
  plotting_df <- GetAssayData(seurat_object, slot = "data", assay = "CLR_ADT") %>%
    as.matrix %>%
    t %>%
    data.frame %>%
    dplyr::select(all_of(plotting_list))
} else {
  stop("Group must be adt or rna")
}

plotting_df <- cbind(plotting_df, seurat_object[["RNA_cluster"]])

plotting_df_long <- plotting_df %>%
  tidyr::pivot_longer(cols = all_of(plotting_list),
                      names_to = "protein",
                      values_to = "expression")
  
plot1 <- ggplot2::ggplot(data = plotting_df_long,
                         mapping = ggplot2::aes(expression, color = protein)) + 
  ggplot2::geom_density() + 
  ggplot2::scale_color_manual(values = protein_colors)


plot2 <- ggplot2::ggplot(data = plotting_df_long,
                         mapping = ggplot2::aes(x = expression,
                                                color = protein,
                                                y = protein,
                                                fill = protein)) +
  ggridges::geom_density_ridges() + 
#  ggridges::theme_ridges()
  ggplot2::scale_color_manual(values = protein_colors) +
  ggplot2::scale_fill_manual(values = protein_colors)

plot_grid(plot1, plot2,
          labels = c("A", "B"),
          nrow = 1, ncol = 2,
          align = "hv", axis = "lr")

Gating based on Zach’s input

ADT_data <- GetAssayData(seurat_data, assay = "CLR_ADT", slot = "data") %>%
  t %>%
  data.frame

# CD27 - from Zach use top 20%
cd27_cutoff <- quantile(ADT_data$CD27.1, probs = .8)

# IgM - from zach bottom 20%
igm_cutoff <- quantile(ADT_data$IgM, prob = 0.2)

# IgD cutoff - a little less than 1, at the plateau
igd_cutoff <- quantile(ADT_data$IgD, prob = 0.5)

# CD21 cutoff CD21 Bottom 10-15% Maybe the bump around 0.5?
cd21_cutoff <- quantile(ADT_data$CD21, prob = 0.1)

ADT_data_long <- ADT_data %>%
  tidyr::pivot_longer(cols = all_of(colnames(.)), names_to = "protein",
                                    values_to = "value")

IgM

# Plots of cutoffs
IgM_plot <- ADT_data_long %>%
  dplyr::filter(protein == "IgM") %>%
  ggplot2::ggplot(ggplot2::aes(x = value, fill = protein)) +
    ggplot2::geom_density() +
    ggplot2::scale_fill_manual(values = protein_colors_adt) + 
    ggplot2::geom_vline(xintercept = igm_cutoff)

print(IgM_plot)

IgD

# Plots of cutoffs
IgD_plot <- ADT_data_long %>%
  dplyr::filter(protein == "IgD") %>%
  ggplot2::ggplot(ggplot2::aes(x = value, fill = protein)) +
    ggplot2::geom_density() +
    ggplot2::scale_fill_manual(values = protein_colors_adt) + 
    ggplot2::geom_vline(xintercept = igd_cutoff)

print(IgD_plot)

CD21

# Plots of cutoffs
CD21_plot <- ADT_data_long %>%
  dplyr::filter(protein == "CD21") %>%
  ggplot2::ggplot(ggplot2::aes(x = value, fill = protein)) +
    ggplot2::geom_density() +
    ggplot2::scale_fill_manual(values = protein_colors_adt) + 
    ggplot2::geom_vline(xintercept = cd21_cutoff)

print(CD21_plot)

Cells that match your cutoffs

I next found the cells that matched each cutoff (low IgM, high IgD, low CD21) and counted the number of cells that passed each filter. The scores are 0-3 where 0 means the cells didn’t pass any filter and 3 means the cell passed all filters.

There were only 42 cells that passed all of the filters

table(seurat_data$pass_filters)

   0    1    2    3 
3099 5576 1058   42 

And they didn’t seem to align to any particular cluster (except maybe the T cell cluster?)

  1. The UMAP colored by the number of protein cutoffs met by the cell
  2. The UMAP with only cells that matched all cutoffs colored
  3. The UMAP with only cells that matched all cutoffs colored by cluster
plot_1 <- plotDimRed(seurat_data, "pass_filters", plot_type = "rna.umap")[[1]]

plot_2 <- plotDimRed(seurat_data, "pass_filters",
                     plot_type = "rna.umap", highlight_group = TRUE,
                     group = 3, meta_data_col = "pass_filters")[[1]]

plot_3 <- plotDimRed(seurat_data, "celltype_cluster",
                     plot_type = "rna.umap", highlight_group = TRUE,
                     group = 3, meta_data_col = "pass_filters")[[1]]

plot_grid(plot_1, plot_2, plot_3, align = "hv", axis = "lr",
          nrow = 2, ncol = 2, labels = c("A", "B", "C"))

Below I’m showing the number of cells that match your cutoffs in each cluster. * celltype_cluster is the name of the cluster with the cell type appended to make it easier to see * n is the number of cells within that cluster that passed all cutoffs * total is the total number of cells in the cluster * fraction is n/total

seurat_data[[]] %>%
  dplyr::select(celltype_cluster, pass_filters) %>%
  dplyr::group_by(celltype_cluster, .drop = FALSE) %>%
  dplyr::count(pass_filters) %>%
  dplyr::group_by(celltype_cluster, .drop = FALSE) %>%
  dplyr::mutate(total = sum(n)) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(fraction = n/total) %>%
  dplyr::filter(pass_filters == 3) %>%
  dplyr::select(celltype_cluster, n, total, fraction)